Detection of Reformulations in Spoken French
نویسندگان
چکیده
Our work addresses automatic detection of enunciations and segments with reformulations in French spoken corpora. The proposed approach is syntagmatic. It is based on reformulation markers and specificities of spoken language. The reference data are built manually and have gone through consensus. Automatic methods, based on rules and CRF machine learning, are proposed in order to detect the enunciations and segments that contain reformulations. With the CRF models, different features are exploited within a window of various sizes. Detection of enunciations with reformulations shows up to 0.66 precision. The tests performed for the detection of reformulated segments indicate that the task remains difficult. The best average performance values reach up to 0.65 F-measure, 0.75 precision, and 0.63 recall. We have several perspectives to this work for improving the detection of reformulated segments and for studying the data from other points of view.
منابع مشابه
Detection and Analysis of Paraphrastic Reformulations in Spoken Corpora (Repérage et analyse de la reformulation paraphrastique dans les corpus oraux) [in French]
Our work addresses the automatic detection of paraphrastic rephrasing in spoken corpus. The proposed approach is syntagmatic. It is based on paraphrastic rephrasing markers and the specificities of the spoken language. Manual annotation performed by two annotators provides fine-grained and multi-dimensional description of the reference data. Automatic method is proposed in order to decide wheth...
متن کاملParaphrastic Reformulations in Spoken Corpora
Our work addresses the automatic detection of paraphrastic reformulation in French spoken corpora. The proposed approach is syntagmatic. It is based on specific markers and the specificities of the spoken language. Manual multi-dimensional annotation performed by two annotators provides fine-grained reference data. An automatic method is proposed in order to decide whether sentences contain or ...
متن کاملAutomatic detection and annotation of disfluencies in spoken French corpora
In this paper we propose a multi-step system for the semiautomatic detection and annotation of disfluencies in spoken corpora. A set of rules, statistical models and machine learning techniques are applied to the input, which is a transcription aligned to the speech signal. The system uses the results of an automatic estimation of prosodic, part-of-speech and shallow syntactic features. We pres...
متن کاملA methodology for the automatic detection of perceived prominent syllables in spoken French
Prosodic transcription of spoken corpora relies mainly on the identification of perceived prominence. However, the manual annotation of prominent phenomena is extremely timeconsuming, and varies greatly from one expert to another. Automating this procedure would be of great importance. In this study, we present the first results of a methodology aiming at an automatic detection of prominence sy...
متن کاملDisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech
We present DisMo, a multi-level annotator for spoken language corpora that integrates part-of-speech tagging with basic disfluency detection and annotation, and multi-word unit recognition. DisMo is a hybrid system that uses a combination of lexical resources, rules, and statistical models based on Conditional Random Fields (CRF). In this paper, we present the first public version of DisMo for ...
متن کامل